Training Neural Networks with Keras

Goals:

Intro: train a neural network with tensorflow and the Keras layers

Dataset:

Digits: 10 class handwritten digits
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits



In [ ]:

    
%matplotlib inline 
# display figures in the notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()



In [ ]:

    
sample_index = 45
plt.figure(figsize=(3, 3))
plt.imshow(digits.images[sample_index], cmap=plt.cm.gray_r,
           interpolation='nearest')
plt.title("image label: %d" % digits.target[sample_index]);

Train / Test Split

Let's keep some held-out data to be able to measure the generalization performance of our model.



In [ ]:

    
from sklearn.model_selection import train_test_split


data = np.asarray(digits.data, dtype='float32')
target = np.asarray(digits.target, dtype='int32')

X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.15, random_state=37)

Preprocessing of the Input Data

Make sure that all input variables are approximately on the same scale via input normalization:



In [ ]:

    
from sklearn import preprocessing


# mean = 0 ; standard deviation = 1.0
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# print(scaler.mean_)
# print(scaler.scale_)

Let's display the one of the transformed sample (after feature standardization):



In [ ]:

    
sample_index = 45
plt.figure(figsize=(3, 3))
plt.imshow(X_train[sample_index].reshape(8, 8),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.title("transformed sample\n(standardization)");

The scaler objects makes it possible to recover the original sample:



In [ ]:

    
plt.figure(figsize=(3, 3))
plt.imshow(scaler.inverse_transform(X_train[sample_index]).reshape(8, 8),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.title("original sample");



In [ ]:

    
print(X_train.shape, y_train.shape)



In [ ]:

    
print(X_test.shape, y_test.shape)

Preprocessing of the Target Data

To train a first neural network we also need to turn the target variable into a vector "one-hot-encoding" representation. Here are the labels of the first samples in the training set encoded as integers:



In [ ]:

    
y_train[:3]

Keras provides a utility function to convert integer-encoded categorical variables as one-hot encoded values:



In [ ]:

    
from tensorflow.keras.utils import to_categorical

Y_train = to_categorical(y_train)
Y_train[:3]

Feed Forward Neural Networks with Keras

Objectives of this section:

Build and train a first feedforward network using Keras
- https://www.tensorflow.org/guide/keras/overview
Experiment with different optimizers, activations, size of layers, initializations

A First Keras Model

We can now build an train a our first feed forward neural network using the high level API from keras:

first we define the model by stacking layers with the right dimensions
then we define a loss function and plug the SGD optimizer
then we feed the model the training data for fixed number of epochs



In [ ]:

    
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import optimizers

input_dim = X_train.shape[1]
hidden_dim = 100
output_dim = 10

model = Sequential()
model.add(Dense(hidden_dim, input_dim=input_dim, activation="tanh"))
model.add(Dense(output_dim, activation="softmax"))

model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, Y_train, validation_split=0.2, epochs=15, batch_size=32)

Visualizing the Convergence



In [ ]:

    
history.history



In [ ]:

    
history.epoch

Let's wrap this into a pandas dataframe for easier plotting:



In [ ]:

    
import pandas as pd

history_df = pd.DataFrame(history.history)
history_df["epoch"] = history.epoch
history_df



In [ ]:

    
fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(12, 6))
history_df.plot(x="epoch", y=["loss", "val_loss"], ax=ax0)
history_df.plot(x="epoch", y=["accuracy", "val_accuracy"], ax=ax1);

Monitoring Convergence with Tensorboard

Tensorboard is a built-in neural network monitoring tool.



In [ ]:

    
%load_ext tensorboard



In [ ]:

    
!rm -rf tensorboard_logs



In [ ]:

    
import datetime
from tensorflow.keras.callbacks import TensorBoard

model = Sequential()
model.add(Dense(hidden_dim, input_dim=input_dim, activation="tanh"))
model.add(Dense(output_dim, activation="softmax"))

model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='categorical_crossentropy', metrics=['accuracy'])

timestamp =  datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = "tensorboard_logs/" + timestamp
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

model.fit(x=X_train, y=Y_train, validation_split=0.2, epochs=15,
          callbacks=[tensorboard_callback]);



In [ ]:

    
%tensorboard --logdir tensorboard_logs

b) Exercises: Impact of the Optimizer

Try to decrease the learning rate value by 10 or 100. What do you observe?
Try to increase the learning rate value to make the optimization diverge.
Configure the SGD optimizer to enable a Nesterov momentum of 0.9

Notes:

The keras API documentation is available at:

https://www.tensorflow.org/api_docs/python/tf/keras

It is also possible to learn more about the parameters of a class by using the question mark: type and evaluate:

optimizers.SGD?

in a jupyter notebook cell.

It is also possible to type the beginning of a function call / constructor and type "shift-tab" after the opening paren:

optimizers.SGD(<shiff-tab>



In [ ]:

    
optimizers.SGD?



In [ ]:



In [ ]:

    
# %load solutions/keras_sgd_and_momentum.py

Replace the SGD optimizer by the Adam optimizer from keras and run it with the default parameters.

Hint: use optimizers.<TAB> to tab-complete the list of implemented optimizers in Keras.
Add another hidden layer and use the "Rectified Linear Unit" for each hidden layer. Can you still train the model with Adam with its default global learning rate?



In [ ]:



In [ ]:

    
# %load solutions/keras_adam.py

Exercises: Forward Pass and Generalization

Compute predictions on test set using model.predict_classes(...)
Compute average accuracy of the model on the test set: the fraction of test samples for which the model makes a prediction that matches the true label.



In [ ]:



In [ ]:

    
# %load solutions/keras_accuracy_on_test_set.py

numpy arrays vs tensorflow tensors

In the previous exercises we used model.predict_classes(...) that returns a numpy array:



In [ ]:

    
predicted_labels_numpy = model.predict_classes(X_test)
predicted_labels_numpy



In [ ]:

    
type(predicted_labels_numpy), predicted_labels_numpy.shape

Alternatively one can directly call the model on the data to get the laster layer (softmax) outputs directly as a tensorflow Tensor:



In [ ]:

    
predictions_tf = model(X_test)
predictions_tf[:5]



In [ ]:

    
type(predictions_tf), predictions_tf.shape

We can use the tensorflow API to check that for each row, the probabilities sum to 1:



In [ ]:

    
import tensorflow as tf

tf.reduce_sum(predictions_tf, axis=1)[:5]

We can also extract the label with the highest probability using the tensorflow API:



In [ ]:

    
predicted_labels_tf = tf.argmax(predictions_tf, axis=1)
predicted_labels_tf[:5]

We can compare those labels to the expected labels to compute the accuracy with the Tensorflow API. Note however that we need an explicit cast from boolean to floating point values to be able to compute the mean accuracy when using the tensorflow tensors:



In [ ]:

    
accuracy_tf = tf.reduce_mean(tf.cast(predicted_labels_tf == y_test, tf.float64))
accuracy_tf

Also note that it is possible to convert tensors to numpy array if one prefer to use numpy:



In [ ]:

    
accuracy_tf.numpy()



In [ ]:

    
predicted_labels_tf[:5]



In [ ]:

    
predicted_labels_tf.numpy()[:5]



In [ ]:

    
(predicted_labels_tf.numpy() == y_test).mean()

Home Assignment: Impact of Initialization

Let us now study the impact of a bad initialization when training a deep feed forward network.

By default Keras dense layers use the "Glorot Uniform" initialization strategy to initialize the weight matrices:

each weight coefficient is randomly sampled from [-scale, scale]
scale is proportional to $\frac{1}{\sqrt{n_{in} + n_{out}}}$

This strategy is known to work well to initialize deep neural networks with "tanh" or "relu" activation functions and then trained with standard SGD.

To assess the impact of initialization let us plug an alternative init scheme into a 2 hidden layers networks with "tanh" activations. For the sake of the example let's use normal distributed weights with a manually adjustable scale (standard deviation) and see the impact the scale value:



In [ ]:

    
from tensorflow.keras import initializers

normal_init = initializers.TruncatedNormal(stddev=0.01)


model = Sequential()
model.add(Dense(hidden_dim, input_dim=input_dim, activation="tanh",
                kernel_initializer=normal_init))
model.add(Dense(hidden_dim, activation="tanh",
                kernel_initializer=normal_init))
model.add(Dense(output_dim, activation="softmax",
                kernel_initializer=normal_init))

model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='categorical_crossentropy', metrics=['accuracy'])



In [ ]:

    
model.layers

Let's have a look at the parameters of the first layer after initialization but before any training has happened:



In [ ]:

    
model.layers[0].weights



In [ ]:

    
w = model.layers[0].weights[0].numpy()
w



In [ ]:

    
w.std()



In [ ]:

    
b = model.layers[0].weights[1].numpy()
b



In [ ]:

    
history = model.fit(X_train, Y_train, epochs=15, batch_size=32)

plt.figure(figsize=(12, 4))
plt.plot(history.history['loss'], label="Truncated Normal init")
plt.legend();

Once the model has been fit, the weights have been updated and notably the biases are no longer 0:



In [ ]:

    
model.layers[0].weights

Questions:

Try the following initialization schemes and see whether the SGD algorithm can successfully train the network or not:
- a very small e.g. stddev=1e-3
- a larger scale e.g. stddev=1 or 10
- initialize all weights to 0 (constant initialization)
What do you observe? Can you find an explanation for those outcomes?
Are more advanced solvers such as SGD with momentum or Adam able to deal better with such bad initializations?



In [ ]:



In [ ]:

    
# %load solutions/keras_initializations.py



In [ ]:

    
# %load solutions/keras_initializations_analysis.py